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ABSTRACT 

As an alternative to a classical test theory basis 
for criterion- ref erenced test construction, it is proposed that a 
strict item-sampling model be used. The computer's role in such a 
model is outlined. The assumptions of the model are carefully defined 
and its properties reviewed. The relationship between mastery 
criteria and such sampling plans as single sampling, simple curtailed 
sampling, end the use of the sequential probability ratio test is 
discussed. Representative operating characteristic curves for a 
number of different plans are included. Suggestions are offered for 
reducing the testing time needed to detect mastery attainment levels 
which are consistent with the Newman-Pearson theory of hypothesis 
testing. Applications are indicated, and an example included, in the 
area of computer-generation and administration of 

criterion-referenced tests of mastery in selected arithmetic skills. 
(DG) 
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THE DEVELOPMENT AND INTERPRETATION OF CRITERION-REFERENCED TESTS 

The particular focus of this paper is the relationship between the 
design of criterion-referenced tests and their use in instructional manage- 
ment systems. s 

Criterion-referenced tests have been used in a variety of forms over 
the las t three decades as tools for educational measurement (Birnbaum,1958; 
Hammock, 1960; Ebel,1962; Lord and Novick,1968) . The different types of 
criterion-referenced tests vary considerably from one to another in terms 
of the underlying model of design. Nevertheless, here is a common point 
of application for such tests, to classify examinees according to higher 
or lower ability as their observed score exceeds or falls short of a given 
criterion value. This appHcation is used in instructional management 
systems to separate learning groups into instructional subgroups in which 
the instructional treatment can be better fitted to the relatively homo- 
geneous ability exhibited by members of each subgroup. 

A second property of criterion-referenced tests of use in instructional 
management systems is that, when properly designed, such tests can give 
estimates of an individual's absolute level of proficiency, rather than 
the relative estimate provided by classical norm-referenced tests. 

A third feature of special interest in computer managed systems is 
the adaptability of the criterion-referenced test to the possibility of 
computer generation of test items and thereby to the development of 
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relatively economical end efficient decision systems. The model for 
developing ard interpreting criterion-referenced tests that is to be des- 
cribed in this paper suggests a role for the computer similar to that of 
the industrial quality control sampling inspector. 

In rough outline, the idea is to specify classes of problems for 
which adequately reliable problem solving behaviors are to be developed 
in the individual by the instructional system. When the "product" (,i..e. 
the individual's specific set of problem solving behaviors) is ready for 
inspection or test, the computer generates a random sample of problems 
from a specified population of items, possibly by using item-generation 
rules. The items are assumed to be of equal importance. They are also 
assumed to be approximately of equal difficulty, in the sense that the 
individual has about the same probability of success on any item randomly 
selected from the population. lue absolute level of this probability is 
assumed to be a function of the degree to which the examinee has developed 
an appropriate set of problem-solving behaviors. The computer's potential 
role in this scheme is not only to generate items, check responses, and 
keep records automatically. It also can effectively administer the sampling 
plans that define the kind and amount of Information required to show that 
the learning "product" offered by the individual meets "design" specifications 
set forth in the instructional package. 

It may be helpful to consider a concrete example at this point. 

As part of a recent instructional management experiment conducted at 
Wisconsin's Research and Development Center, the cooperating teacher was 
in the process of teaching a review of reduction of fractions to lowest 
terms. This segment was to ’je followed by a unit new to the class: adding 
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simple fractions with unlike denominators. 

Three criterion-referenced tests were developed independently 

by different members of the project staff using specified item-generation 

> 

rules. Test A was designed to measure proficiency in reducing fractions 
to lowest terms when the outcome is a common fraction. Test B measures 
a similar proficiency except that the outcome is a mixed number. Test C 
is a measure of proficiency in adding simple fractions. Each pretest and 
posttest contained a random sample of five items selected from the item 
population defined by the item generating rules. These tests together 
with a brief summary of pre- and posttest results are Included in Appendix A 
as illustrations of one application of the model we are about to describe. 
Properties of An Item-Sampling Model . 

The basis of criterion-referenced test construction proposed here 
is a strict item-sampling model. One first defines a specific category 
of problems, either by means of item-generating rules or, if necessary, 
by simply listing the entire population. This population we call a 
specified content objective (SCO) inasmuch as it is the intended objective 
of instruction to develop Individually effective sets of problem solving 
behavior relevant to the SCO. 

At any given time, it is assumed that Individual £ possesses a single 

proficiency with respect to a specified content objective. A measure of 

this proficiency is the individual's relative true score on the population 

of items, which we label C . According to this definition, proficiency is 

a 

a parameter which can be interpreted as t..<- probability of the individual's 
making an acceptable response to any Hem randomly selected f~om the 
specified population. 
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For present purposes, a criterion-referenced test may be considered 
to consist of constructed-response, binary items. Assuming local independence 
the examinee's performance on such a test may be regarded as a series of 
n independent Bernoulli trials having probability of success C on each 

“ a 

trial, where n ■ number of test items. 

If the examinee were to be repeatedly tested with different random 

samples of size n or If a homogeneous proficiency group (defined by " Cj 

for every pair of examinees i. and were to take the test, the expected 

distribution of scores, x , would be given by the ordinary binomial 

a 

n x n-x 

«*.)<( . ) C. <! -<.) * 

a 

According to the model, each examinee responds to an item as though he 

were tossing a coin having bias £ . 

The following are well-known properties of tests built according to 

an item-sampling model (Lord and Novick, 1968, 251) . The observed test score, 

x , Is a sufficient statistic for estimating £ . Secondly, the error of 
a a 

measurement is given by n « x fl -nC fl . Since the expected value of the 
test score is also n? ,it follows that the expected error over repeated 

Q 

testings for a given examinee is zero. In other words, the longer the 
test (within practical limits), the better the true score estimate. Error 
variance is given by the usual relation 

An,) - nc a (l - C,> 

for which an estimate, unbiased over item sampling is 

S 2 (n ) - x a (n - x a )/(n - 1) 
a “ 



O 




^ 



It is interesting to note that the error of measurement is a function 
only of test length, n, and the examinee's proficiency, Therefore, 

■ o 

if estimation of proficiency is the essential purpose of the test, then 
classical item selection techniques involving the consideration of such 
item parameters as the p-value, item-test correlation, and discrimination 
coefficient are of no use in the design of criterion-referenced tests 
constructed according to the model proposed here. 

Techniques of Quality Control Using Criterion-Referenced Tests . 

An Individual's proficiency with respect to a specified content 
objective may also be regarded as a measure of product quality for a 
given instructional package. Since proficiency is assumed to be a mono- 
tonlcally increasing function of Instructional time, a problem of interest 
to the instructional manager is the estimation of proficiency at given 
points in time to determine whether or not it meets some minimal criterion 
of acceptance. A method of handling this decision problem Is as follows. 

Collecting a population of items into a specified content objective 
may be compared with the industrial procedure of dividing output into 
inspection lots. In the simplest sense, one may consider a semester's 
work as an ordered sequence of specified objectives. All the questions 
to be asked over the semester are divided into "Inspection lots" or SCO's. 
The quality of individual test items contained in the lot is judged, again 
in the simplest case, according to the binary attribute "acceptable" or 
"unacceptable" depending upon the individual's response to the Item. 

Each examinee possesses a particular proficiency at a given time for 
producing "acceptable" items. The criter. n-referenced test la regarded 
in this view as a random sample of the examinee's potential production on 
a given inspection lot or SCO. On the basis of the observed score, one 
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must decide whether or not to accept the lot (i.eu judge the examinee a 
roaster) or reject the lot (decide that the Individual's problem solving 
behaviors must be improved). This raises the usual questions Involved 
in hypothesis testing concerning what size test and which criterion value 
should be selected so that the errors of classification are held within 
specified probability limits. 

Assistance in solving this problem can be found in the vast literature 
dealing with the construction and selection of sampling plans. Time permits 
our sketching only an overview of the basic ideas here. 

A sampling plan may be defined conveniently for our purposes in one 

'J 

of two ways. A single sampling plan is defined simply by selecting a value 
for test length, n, and an error criterion, £. An alternative method that 
is more useful when one is considering certain kinds of curtailed sampling 
plans is to specify the probability of a type I error of classification, a, 
and the probability of a type 11 error of classification, (5. This method 
also requires that two additional quantities be specified: the minimum 

proficiency that sets a lower bound to the mastery range, and the 
maximum proficiency that sets an upper bound to the nonmastery range, C 2 . 

The range of proficiency between £ t and is called the region of indifference. 
Specification of a, 8, Cj* and Cj is equivalent to specifying n and £ and 
conversely. The equations relating these six quantities are easily derived 
from the item-sampling model described earlier. 

The quality control characteristics offered by a particular sampling plan 
are revealed by a function called the operating characteristic (OC) . The 
OC s imply enables one to compute the probability of deciding in favor of 
acceptance or mastery as a function of the individual's true proficiency. 
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The general shape of the OC curve is very similar to the usual item 
characteristic curves found in classical test theory. In general the pro- 
bability of an examiner being classified a master on a given SCO is a 
decreasing function of his error rate, ** 1-C Q . If he never makes an 
error, there is a probability of one that, he will be Judged a master; if 
he always makes errors, the probability of his classification as a master 
is zero. For each error ratio between zero snd one, the OC curve shows 
the probability (S) of a "successful" or mastery decision being made. 

If a large number of examinees of given proficiency are tested on 
the same SCO, some will be classified as masters and some as nonmasters. . 
The OC curve can be derived from the item-sampling model by computing the 
probability that an individual with any given proficiency, will make 
fewer than c errors. This condition for a mastery classification is given 
by the cumulative probability function: 

c-1 

(1) S - l ( « ) C n " w (1 - C) w 

w»o 

where w * n - x, the number of wrong responses made on n items. Equation 
(1) is the OC for single sampling plans of size £ and error criterion £. 
Figures 1, 2 and 3 show representative OC curves for a number of different 
plans, including one for the illustrative tests Included in Appendix A. 



Insert Figures 1, 2 and 3 V 'ut here 



From an examination of these curves, it may be seen that the OC func- 
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tions somewhat like an item characteristic curve. The value of the criterion 
£ roughly determines the proficiency level at which there is an equal chance 
of being classified master or nonma3ter. This is the region of steepest de- 
scent for the OC curve and may be considered as the proficiency level at which 
the test is maximally discriminating. The test length 11 determines the steep- 
ness of the slope &.-.1 hence the sharpness with which the test discriminates 
between "high" and "low" proficiencies. 

It should be noted that setting a higher criterion (or lower error cri- 
terion) does not of itself improve the proficiency found in those examinees 
classified as masters. In education, as in industry, quality cannot be in- 
spected into a product. Rather the instructional package must be improved 
if higher quality is desired in the learning product. 

MINIMIZING TEST LENGTH 

If the criterion referenced test can be administered via interactive 
terminals, the model suggested here is well adapted to the study of sampling 
plans that minimize the number of questions required to classify students 
with fixed error probabilities a and 8. 

This may appear contrary to Neyman-Pearson theory, which shows that a 
and 8 depend on test length n.. Nevertheless curtailment of tests is possible 
without causing a change in either a or in 8. 

For example, if a single sampling plan defined by n ■ 5 and c ■ 2 were 
employed, one could curtail the test as soon as two errors are observed or 
as soon as 4 correct responses are noted. The final decision in each case 
is exactly the same as that which would be made if the test ran to comple- 
tion. In this case, the curtailed plan and the single sample non-curtailed 
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plan have exactly the same OC curve and hence the same error probabilities 
a, 6. What is lost by curtailment is the accuracy of estimation of an exa- 
minee's proficiency or true score. 

A sampling plan that minimizes the test length for given values of a, $ 
at Ci and C 2 exists. This is Wald's sequential probability ratio test or 
SPRT (Wald* 1947). 

Sampling plans that reduce the number of questions required to reach 
classification decisions without loss of protection against errors of clas- 
sification are of interest for two counts. Cost of testing is proportional 
to the number of items as is the length of time taken from instruction for 
testing purposes. It is highly desirable therefore to minimize test length 
while still providing for accurate decision making. 

Figures 4, 5 and 6 show how the average sample number (ASN) , or expected 
test length, varies with error rate for curtailed tests having the same OC as 
the single sample plans shown in previous figures. 



Insert Figures 4, 5 and 6 about here 



The distance from the horizoncal line representing the fixed sample to the 
ASN is a measure of the average saving in test length for these curtailed 
plans . 

The saving that curtailed testing provides together with the growing 
interest in computer generation of test items warrant further study in con- ** 
nection with research on the design of economical and practical syp^ems of 
computer-based instructional management. 
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Appendix A 

Three sample criterion-referenced preteste are shown on pages 
A-l and A-2. Posttests containing items randomly sampled from corres- 
ponding pools are shown on A-3 and A-4. The "fail-safe" box simply 
permitted pupils who felt they had no proficiency whatever on a given 
set of problems to bypass the set without embarrassment. Pupils made 
use of this option mainly on pretest C. 

Test results for a class of 19 fifth-graders are summarized on 
pages A-5 to A-7 . The "high" group on each test consisted of pupils 
who made fewer than two errors. Test reliabilities are relatively high 
for the total group but become erratic when computed for the relatively 
homogeneous proficiency subgroups. This is simply an expected consequence 
of the fact that variation of scores within a subgroup is mostly error 
variation. 

The observed proficiency gains and transition matrices are illust- 
rative of the kind of management information that this type of criterion- 
referenced test may provide. 

The OC curve for the sampling plan employed on these tests (n ** 5; 
c « 2) is shown on Figure 1 in the main body of this paper. If the tests 
are expected to discriminate a maximum error rate of “ *15 for the high 
group and a minimum error rate of ” .65 for the low group, the OC curve 
shows that the errors of classification will be approximately a ■ .16 and 
3 ■ .04 for this plan. Shifting to an error criterion of c ■ 1 would greatly 
increase a but only slightly decrease 3. 




SAMPLE PRETEST 



A-l 



Name 



Instructions* There are three parts on this test* If you decide you don’t know 
how to do the problems in any part, put an I in the "fail-safe" 
box and go on to the next part* 



Part A 



For each fraction, 
lowest terms* 

EXAMPLE! * 



Fail-safe* 

find the equivalent fraction .in 
Problem Response 




Problem Response 

1 * 10 
36 = 



Problem Response 

3 * 22 - 
60 “ 



Proble m Response 

5 . 3 £ „ 

1*2 ” 



2 * 



18 

1*8 



h* 12 
lh S 



Part B 



For each fraction, 
lowest terms, 

EXAMPLE: 



Fail-safe t 

find the equivalent mixed fraction in 



Problem Response 



30 

12 




Problem Response 



Problem Response Problem Response 



6. h2 _ 

12 “ 




10 * 63 _ 
2 $ ~ 



7 * 63 
51 * 



9. J*2 

9 



O 
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SAMPLE PRETEST 



A- 2 



Part C 



Find the sums* 
EXAMPLE i 



Problem 




Solution 



s 

30 30 



Fail-safe i 




Response 

29 

30 



Problem Solution Response 

11. 3 + 1 _ 
k 8 “ 



Problem Solution Response 

Hi. £ * i _ 

7 2” 



12 ' ^*2 = 
5 7 



! 5 . 1 * 11 , 

9 7 



13. 0^6 
ll 7 




SAMPLE POSTTEST 



A- 3 



Name 



Instructions : There are three parts on this test* If you decide you don’t 

know how to do the problems in any part, put an X in the "fail- 
safe” box and go on to the next part. 



Part A 



Fail-safe* 



For each fraction, find the equivalent fraction in 
lowest terms. 



Problem Response Problem Response Problem Response 

1. 35 3. 35 5. 10 _ 

SO = b9 28 



2o- lS _ 

35 “ 



tu 21 

30 



Part B 



Fail-safe: 

For each fraction, find the equivalent mixed fraction in 
lowest terms. 



Problem Response 

6 . 50 

21 = 



Problem Response 

8 . 70 

21 



Problem Response 

i°. 9Q 

25 



US _ 
35 



9 . 70 

5U 




SAMPLE POSTTEST 



A- 4 



Part C 



Find the sums. 



Fail-safe i 



Problem Solution Response 

11 . 6 + 2 _ 

7 9“ 



Problem Solution 

1U. + 6 

9 7" 




15 . 2 + 1 
7 S = 





Response 
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TRANSITION MATRICES 



To Level 



High 



Low 



Initial 

Totals 



Test A 
From Level 



Final. 

High Low Totals 
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To Level 



Test B 
From Level 

Final 
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Low 
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From Level 

Final 
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0-C Curves for N = 5; c = 1, 2. 
Figure 1 
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Figure 2 
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Figure 3 
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Average Sample Number for N » 5 
And Error Criteria c ■ 1, 2 




Figure 4 
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ASN Curves for N ■ 20 
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Figure 5 
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